Results: Donor Importance

Show tabs for Years 1,3,5 (and 90d?)

Save data so can load in model_data and abnormal_data

Calibration Plots are essentially the same as what I’m calling unexplained effects

rerun data and save data files (as csv or rds)

Causal Forests

Add back in PTR_SEQUENCE_NUMBER! This is useful to say doesn’t influence outcomes!

Outliers in TBILI_CAND_TX. Truncate at 40?

1 MODEL FIXES

1.1 functions.R

  • add … to model_refit() for chdy models since no tuning parameters
  • add importance = impurity to rf models in model_refit()

2 data split

performance dropped when I used the TX_YEAR nested cv. This is probably because the training data was smaller. Originaly, I held out 200 observations, but in this strategy the holdout data was up to 450 or so.

3 TODO:

  1. Literature review: all existing models that use donor variables.
    • in peds
    • also in adults Argument: the donor variables may not influence survival enough to consider.

3.1 Notes

  1. This is using the classification scenario like used in Miller et al (Ped Tx, 2019). But may want to compare with SRTR model https://www.srtr.org/tools/posttransplant-outcomes/ that is based on Cox-PH model (classification vs. survival)

  2. Use Repeated/Nested CV to reduce variability in the:

    1. tuning parameters
    2. variable importance scores
    3. test performance
    4. calibration/diagnostic assessment
  3. Because we are removing censored observations, the actual survival rate is higher. Our estimates are biased. We don’t adjust because our goal is evaluating donor variables, not actual survival modeling.

3.2 Hypothesis

The primary hypothesis is: Given a normal echo, ischemic time < 6hrs and donor < 30 yrs, no other donor variables are related to outcomes (using current acceptance practice). The original hypothesis was: Donor Characteristics aren’t related to outcome. We switched to make the paper clearer and more direct.

3.3 Approach

  1. Remove transplants with abnormal echos.

  2. Create nested cross-validation.

    • The outer split was on TX_YEAR, so all test performance estimates are based on one-year’s worth of transplants.
    • the inner split 10-fold cross-validation using status1 (one-year survival) as stratifying variables
    • Because of censoring the counts (and proportion) of train/test are a bit different.
  3. Fit three models (RF, BT, LR) using several predictor sets (e.g., all, candidate only, donor only, candidate_comparison_ischemic) on the training data. Used 10-fold cross-validation to select tuning (or hyper) parameters. This provides: a) CV estimated predictive performance, b) the optimal tuning parameters, and c) the final model that uses the optimal tuning parameter(s) to refit with all the training data. This last step is used to obtain feature importance scores and a prediction function to compute test performance.

    • Fit a final model using the optimal tuning parameter for each metric (ROC, log-loss, and brier score).
  4. Assess donor importance using:

    1. Compare predictive performance on CV across predictor sets (for all three models)
    2. Repeat on test data
    3. Explore feature importance scores for models fit will all predictors and best models.
    4. Predictive diagnostics: any unexplained patterns in predictive residuals. Use candidate only (or best simple model).

4 Overall Survival

Rows: 116 Columns: 4
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (4): type, var, description, notes

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

NOTE: the 5-year survival for 2018 and 2019 is badly skewed! Don’t use for anything!!! - Need to think about including in the cross-validation.

NOTE: 2014 is an outlier. Survival drops below 90%.

Table of transplants and survival. Note: n_tx are the total number of transplants with normal echos. But the other values are based on unknown/censored outcomes being removed!

Training and Test data sizes:

5 Predictive Performance

model name
lr Logistic Regression (lasso penalty)
bt Boosted Trees (xgboost)
bt1 Boosted Stumps (GAM)
rf random forest
chdy Choudhry’s model (ECMO, Ventilator, AGE<1, eGRF, TBILI)

This shows the predictive performance by metric using the averages from predicting each of the 10 years using models developed from the other 9 years.

5.1 Nested CV Results

Log Loss

Brier Score

AUC

Number of times model/var had best performance:

Log Loss

Brier Score

AUC

Number of times model/var had best performance:

NOTE: the outcome data is censored at 5 year. This is probably not good to report this.

Log Loss

Brier Score

AUC

Number of times model/var had best performance:

5.2 Modeling Predictive Performance

This implements a mixed-effects (hierarchical) Bayesian model to capture which elements drive predictive performance. The model is:

performance ~ intercept + model_effects + var_effects + (1|resample) 

where the intercept corresponds to baseline performance with model = logistic regression and variable set = candidate only.

6 Predictive Diagnostics

Get predictions from optimal logistic regression model using candidate only variables and 1-year survival as the outcome of interest.

Use the predicted probability of survival as an offset in modeling the hold-out data. We then model the hold-out data for each predictor variable and check for significant effects which would indicate evidence that the variable should be included in the model.

  • This is especially important for checking for unaccounted donor, offer, or donor-candidate effects.

\[ \log\left( \frac{p}{1-p} \right) = \log\left( \frac{\hat{p}}{1-\hat{p}} \right) + s(x) \] Predictions:

[1] "starting split 1 of 10"
[1] "starting split 2 of 10"
[1] "starting split 3 of 10"
[1] "starting split 4 of 10"
[1] "starting split 5 of 10"
[1] "starting split 6 of 10"
[1] "starting split 7 of 10"
[1] "starting split 8 of 10"
[1] "starting split 9 of 10"
[1] "starting split 10 of 10"

6.1 Calibration of predicted probabilities

The 1-year survival models using logistic regression with candidate only variables have good calibration. Note that 90% of transplants have a predicted one year survival over 0.85.

6.2 Unexplained effects

Calibration by predictor variable (unexplained effects)

Run through all TX_YEARs and all variables. Fit a calibration model with offset as the model predictions (logit scale). Check for any statistically significant patterns.

[1] "starting split_id: 1"
[1] "starting split_id: 2"
[1] "starting split_id: 3"
[1] "starting split_id: 4"
[1] "starting split_id: 5"
[1] "starting split_id: 6"
[1] "starting split_id: 7"
[1] "starting split_id: 8"
[1] "starting split_id: 9"
[1] "starting split_id: 10"
`summarise()` has grouped output by 'var', 'metric'. You can override using the
`.groups` argument.

Combine results from all splits before modeling:

6.3 Partial Effects

The previous analysis suggests that the candidate only predictions explain the outcomes pretty well. That is, there are no residual patterns in the donor, donor-candidate, or offer variables that would offer significant improvements in predictions.

However, we know that some variables, like Ischemic Time and Donor-Candidate size compatability, should be related to outcomes. What is potentially happening is that the candidate variables are capture the effects of donor and offer variables due to associations.

Per split:

[1] "starting split_id: 1"
[1] "starting split_id: 2"
[1] "starting split_id: 3"
[1] "starting split_id: 4"
[1] "starting split_id: 5"
[1] "starting split_id: 6"
[1] "starting split_id: 7"
[1] "starting split_id: 8"
[1] "starting split_id: 9"
[1] "starting split_id: 10"
`summarise()` has grouped output by 'var'. You can override using the `.groups`
argument.
# A tibble: 408 × 5
   var                metric        min less_05 less_01
   <chr>              <chr>       <dbl>   <int>   <int>
 1 AGE_CAND           brier_class     0      10      10
 2 AGE_CAND           mn_log_loss     0      10      10
 3 AGE_CAND           roc_auc         0      10      10
 4 AGE_DON            brier_class     0      10      10
 5 AGE_DON            mn_log_loss     0      10      10
 6 AGE_DON            roc_auc         0      10      10
 7 BSA_CAND           brier_class     0      10      10
 8 BSA_CAND           mn_log_loss     0      10      10
 9 BSA_CAND           roc_auc         0      10      10
10 BSA_DON            brier_class     0      10      10
11 BSA_DON            mn_log_loss     0      10      10
12 BSA_DON            roc_auc         0      10      10
13 CAND_DIAG          brier_class     0      10      10
14 CAND_DIAG          mn_log_loss     0      10      10
15 CAND_DIAG          roc_auc         0      10      10
16 CAND_DIAG_CODE     brier_class     0      10      10
17 CAND_DIAG_CODE     mn_log_loss     0      10      10
18 CAND_DIAG_CODE     roc_auc         0      10      10
19 CAND_DIAG_full     brier_class     0      10      10
20 CAND_DIAG_full     mn_log_loss     0      10      10
21 CAND_DIAG_full     roc_auc         0      10      10
22 CAND_DIAG_primary  brier_class     0      10      10
23 CAND_DIAG_primary  mn_log_loss     0      10      10
24 CAND_DIAG_primary  roc_auc         0      10      10
25 FUNC_STAT_CAND_REG brier_class     0      10      10
26 FUNC_STAT_CAND_REG mn_log_loss     0      10      10
27 FUNC_STAT_CAND_REG roc_auc         0      10      10
28 FUNC_STAT_CAND_TX  brier_class     0      10      10
29 FUNC_STAT_CAND_TX  mn_log_loss     0      10      10
30 FUNC_STAT_CAND_TX  roc_auc         0      10      10
31 HEIGHT_CAND_CM     brier_class     0      10      10
32 HEIGHT_CAND_CM     mn_log_loss     0      10      10
33 HEIGHT_CAND_CM     roc_auc         0      10      10
34 HEIGHT_DON_CM      brier_class     0      10      10
35 HEIGHT_DON_CM      mn_log_loss     0      10      10
36 HEIGHT_DON_CM      roc_auc         0      10      10
37 INFECTION_CAND     brier_class     0      10      10
38 INFECTION_CAND     mn_log_loss     0      10      10
39 INFECTION_CAND     roc_auc         0      10      10
40 TRANSFUSIONS_CAND  brier_class     0      10      10
# ℹ 368 more rows
`summarise()` has grouped output by 'var'. You can override using the `.groups`
argument.
# A tibble: 136 × 5
   var                 metric              min less_05 less_01
   <chr>               <chr>             <dbl>   <int>   <int>
 1 AGE_CAND            mn_log_loss 0                10      10
 2 AGE_DON             mn_log_loss 0                10      10
 3 BSA_CAND            mn_log_loss 0                10      10
 4 BSA_DON             mn_log_loss 0                10      10
 5 CAND_DIAG           mn_log_loss 0                10      10
 6 CAND_DIAG_CODE      mn_log_loss 0                10      10
 7 CAND_DIAG_full      mn_log_loss 0                10      10
 8 CAND_DIAG_primary   mn_log_loss 0                10      10
 9 FUNC_STAT_CAND_REG  mn_log_loss 0                10      10
10 FUNC_STAT_CAND_TX   mn_log_loss 0                10      10
11 HEIGHT_CAND_CM      mn_log_loss 0                10      10
12 HEIGHT_DON_CM       mn_log_loss 0                10      10
13 INFECTION_CAND      mn_log_loss 0                10      10
14 TRANSFUSIONS_CAND   mn_log_loss 0                10      10
15 VENTILATOR_CAND_TX  mn_log_loss 0                10      10
16 VENT_SUPPORT_TX     mn_log_loss 0                10      10
17 WEIGHT_CAND_KG      mn_log_loss 0                10      10
18 WEIGHT_DON_KG       mn_log_loss 0                10      10
19 BMI_DON             mn_log_loss 0.00000157       10      10
20 ECMO_CAND_TX        mn_log_loss 0                 9       9
21 TBILI_CAND_TX       mn_log_loss 0                 9       9
22 MED_COND_CAND_TX    mn_log_loss 0.000000862      10       8
23 PO2_DON             mn_log_loss 0.00000352        8       8
24 VENTILATOR_CAND_REG mn_log_loss 0                 9       7
25 ALBUM_CAND_REG      mn_log_loss 0.00000434        8       7
26 eGFR_CAND_TX        mn_log_loss 0                 7       7
27 DEATH_CIRCUM_DON    mn_log_loss 0.0000605         9       6
28 ECMO_CAND_REG       mn_log_loss 0                 8       6
29 POSTERIOR_WALL      mn_log_loss 0.000000449       8       6
30 DIAL_CAND           mn_log_loss 0                 7       6
31 HEIGHT_RATIO        mn_log_loss 0.00000471        7       6
32 SEPTAL_WALL         mn_log_loss 0.00000505        7       6
33 TRANSFUS_TERM_DON   mn_log_loss 0.000269          9       5
34 ISCHTIME            mn_log_loss 0                 7       5
35 VAD_CAND_TX         mn_log_loss 0.00000522        7       5
36 RACE_CAND           mn_log_loss 0.000136          7       5
37 ABO_STR             mn_log_loss 0                 6       5
38 CREAT_CAND_TX       mn_log_loss 0.000296          6       5
39 CPRA                mn_log_loss 0                 5       5
40 CPRA_PEAK           mn_log_loss 0                 5       5
# ℹ 96 more rows

Combined over all splits

6.3.1 Plots of Partial Effects

This is like a SHAP value - how the individual variables relate to predictions. TODO: Add color to points. Survived vs. not. Do this in two calls to geom_point() so the not-survived (minority class) shows on top.

`geom_smooth()` using formula = 'y ~ s(x, bs = "cs")'

`geom_smooth()` using formula = 'y ~ s(x, bs = "cs")'

`geom_smooth()` using formula = 'y ~ s(x, bs = "cs")'

`geom_smooth()` using formula = 'y ~ s(x, bs = "cs")'

`geom_smooth()` using formula = 'y ~ s(x, bs = "cs")'

`geom_smooth()` using formula = 'y ~ s(x, bs = "cs")'

`geom_smooth()` using formula = 'y ~ s(x, bs = "cs")'

`geom_smooth()` using formula = 'y ~ s(x, bs = "cs")'

`geom_smooth()` using formula = 'y ~ s(x, bs = "cs")'

`geom_smooth()` using formula = 'y ~ s(x, bs = "cs")'

`geom_smooth()` using formula = 'y ~ s(x, bs = "cs")'

`geom_smooth()` using formula = 'y ~ s(x, bs = "cs")'

7 Variable Importance

7.0.1 Using “all variables”:

$bt
# A tibble: 18 × 7
   year  model Variable                    total brier_class mn_log_loss roc_auc
   <chr> <chr> <chr>                       <dbl>       <dbl>       <dbl>   <dbl>
 1 1     bt    BMI_DIFF                       30          10          10      10
 2 1     bt    CAND_DIAG_Congenital.Heart…    30          10          10      10
 3 1     bt    eGFR_CAND_TX                   30          10          10      10
 4 1     bt    TBILI_CAND_TX                  30          10          10      10
 5 1     bt    CPRA                           29          10          10       9
 6 1     bt    ECMO_CAND_TX_Yes               17           7           4       6
 7 1     bt    INFECTION_CAND_Yes             10           2           4       4
 8 1     bt    PCO2_DON                        6           2           2       2
 9 1     bt    LISTING_CTR_PREV_SURVIVAL_…     5           2           2       1
10 1     bt    BMI_CAND                        4           1           2       1
11 1     bt    VENTILATOR_CAND_TX_Yes          4           0           2       2
12 1     bt    BUN_DON                         3           0           1       2
13 1     bt    HEIGHT_CAND_CM                  3           0           2       1
14 1     bt    ISCHTIME                        3           2           1       0
15 1     bt    DA1_DON                         2           2           0       0
16 1     bt    PO2_DON                         2           0           0       2
17 1     bt    CPR_DURATION_DON                1           1           0       0
18 1     bt    HEMATOCRIT_DON                  1           1           0       0

$bt1
# A tibble: 15 × 7
   year  model Variable                    total brier_class mn_log_loss roc_auc
   <chr> <chr> <chr>                       <dbl>       <dbl>       <dbl>   <dbl>
 1 1     bt1   BMI_DIFF                       30          10          10      10
 2 1     bt1   CAND_DIAG_Congenital.Heart…    30          10          10      10
 3 1     bt1   CPRA                           30          10          10      10
 4 1     bt1   eGFR_CAND_TX                   30          10          10      10
 5 1     bt1   TBILI_CAND_TX                  30          10          10      10
 6 1     bt1   INFECTION_CAND_Yes             16           5           7       4
 7 1     bt1   ECMO_CAND_TX_Yes               13           5           4       4
 8 1     bt1   HEIGHT_CAND_CM                 10           4           1       5
 9 1     bt1   VENTILATOR_CAND_TX_Yes          8           2           3       3
10 1     bt1   BMI_DON                         5           2           1       2
11 1     bt1   WEIGHT_DON_KG                   3           1           2       0
12 1     bt1   TRANSFUSIONS_CAND_Yes           2           0           1       1
13 1     bt1   CPR_DURATION_DON                1           0           0       1
14 1     bt1   ISCHTIME                        1           0           1       0
15 1     bt1   POSTERIOR_WALL                  1           1           0       0

$lr
# A tibble: 12 × 7
   year  model Variable                    total brier_class mn_log_loss roc_auc
   <chr> <chr> <chr>                       <dbl>       <dbl>       <dbl>   <dbl>
 1 1     lr    BMI_DIFF_poly_1                30          10          10      10
 2 1     lr    CAND_DIAG_Congenital.Heart…    30          10          10      10
 3 1     lr    CPRA                           30          10          10      10
 4 1     lr    TBILI_CAND_TX                  30          10          10      10
 5 1     lr    INFECTION_CAND_Yes             25           8           8       9
 6 1     lr    ECMO_CAND_TX_Yes               22           7           7       8
 7 1     lr    VENTILATOR_CAND_TX_Yes         13           4           4       5
 8 1     lr    DEATH_CIRCUM_DON_Natural.C…    12           5           5       2
 9 1     lr    TRANSFUSIONS_CAND_Yes           9           3           3       3
10 1     lr    RACE_CAND_Hispanic              5           2           2       1
11 1     lr    TX_DOW_Monday                   3           1           1       1
12 1     lr    HEIGHT_CAND_CM                  1           0           0       1

$rf
# A tibble: 13 × 7
   year  model Variable                    total brier_class mn_log_loss roc_auc
   <chr> <chr> <chr>                       <dbl>       <dbl>       <dbl>   <dbl>
 1 1     rf    BMI_CAND                       30          10          10      10
 2 1     rf    BMI_DIFF                       30          10          10      10
 3 1     rf    eGFR_CAND_TX                   30          10          10      10
 4 1     rf    ISCHTIME                       30          10          10      10
 5 1     rf    TBILI_CAND_TX                  30          10          10      10
 6 1     rf    LISTING_CTR_PREV_SURVIVAL_…    21           7           7       7
 7 1     rf    HRS_DECEASED_AT_CLAMP          10           2           4       4
 8 1     rf    HEMATOCRIT_DON                  7           1           3       3
 9 1     rf    WEIGHT_CAND_KG                  7           3           2       2
10 1     rf    DAYS_ACTIVE                     6           2           2       2
11 1     rf    DAYS_STAT1A                     4           2           1       1
12 1     rf    WEIGHT_RATIO                    4           2           1       1
13 1     rf    BSA_CAND                        1           1           0       0

7.0.2 Using “candidate only” variables:

$bt
# A tibble: 14 × 7
   year  model Variable                    total brier_class mn_log_loss roc_auc
   <chr> <chr> <chr>                       <dbl>       <dbl>       <dbl>   <dbl>
 1 1     bt    BMI_DIFF                       30          10          10      10
 2 1     bt    CAND_DIAG_Congenital.Heart…    30          10          10      10
 3 1     bt    CPRA                           30          10          10      10
 4 1     bt    eGFR_CAND_TX                   30          10          10      10
 5 1     bt    TBILI_CAND_TX                  30          10          10      10
 6 1     bt    HEIGHT_CAND_CM                 19           6           7       6
 7 1     bt    ISCHTIME                       16           5           6       5
 8 1     bt    ECMO_CAND_TX_Yes                7           2           2       3
 9 1     bt    INFECTION_CAND_Yes              6           1           2       3
10 1     bt    BMI_CAND                        4           2           1       1
11 1     bt    DAYS_STAT1A                     3           2           1       0
12 1     bt    ALBUM_CAND_REG                  2           1           1       0
13 1     bt    HLAMIS                          2           1           0       1
14 1     bt    VENTILATOR_CAND_TX_Yes          1           0           0       1

$bt1
# A tibble: 12 × 7
   year  model Variable                    total brier_class mn_log_loss roc_auc
   <chr> <chr> <chr>                       <dbl>       <dbl>       <dbl>   <dbl>
 1 1     bt1   BMI_DIFF                       30          10          10      10
 2 1     bt1   CAND_DIAG_Congenital.Heart…    30          10          10      10
 3 1     bt1   CPRA                           30          10          10      10
 4 1     bt1   eGFR_CAND_TX                   30          10          10      10
 5 1     bt1   TBILI_CAND_TX                  30          10          10      10
 6 1     bt1   HEIGHT_CAND_CM                 23           7           9       7
 7 1     bt1   INFECTION_CAND_Yes             14           5           4       5
 8 1     bt1   ISCHTIME                       12           4           4       4
 9 1     bt1   ECMO_CAND_TX_Yes                5           1           2       2
10 1     bt1   VENTILATOR_CAND_TX_Yes          3           1           0       2
11 1     bt1   ALBUM_CAND_REG                  2           1           1       0
12 1     bt1   TRANSFUSIONS_CAND_Yes           1           1           0       0

$lr
# A tibble: 14 × 7
   year  model Variable                    total brier_class mn_log_loss roc_auc
   <chr> <chr> <chr>                       <dbl>       <dbl>       <dbl>   <dbl>
 1 1     lr    CAND_DIAG_Congenital.Heart…    30          10          10      10
 2 1     lr    CPRA                           30          10          10      10
 3 1     lr    TBILI_CAND_TX                  28           8          10      10
 4 1     lr    BMI_DIFF_poly_1                25           7           8      10
 5 1     lr    HEIGHT_CAND_CM                 19           7           7       5
 6 1     lr    INFECTION_CAND_Yes             19           4           7       8
 7 1     lr    ECMO_CAND_TX_Yes               16           2           5       9
 8 1     lr    RACE_CAND_Hispanic             12           6           4       2
 9 1     lr    eGFR_CAND_TX                    8           6           2       0
10 1     lr    eGFR_CAND_TX_90                 8           6           2       0
11 1     lr    VENTILATOR_CAND_TX_Yes          7           1           2       4
12 1     lr    TRANSFUSIONS_CAND_Yes           6           2           2       2
13 1     lr    ABO_CAND_AB                     1           0           1       0
14 1     lr    CMV_CAND_Positive               1           1           0       0

$rf
# A tibble: 10 × 7
   year  model Variable       total brier_class mn_log_loss roc_auc
   <chr> <chr> <chr>          <dbl>       <dbl>       <dbl>   <dbl>
 1 1     rf    BMI_CAND          30          10          10      10
 2 1     rf    BMI_DIFF          30          10          10      10
 3 1     rf    BSA_CAND          30          10          10      10
 4 1     rf    eGFR_CAND_TX      30          10          10      10
 5 1     rf    ISCHTIME          30          10          10      10
 6 1     rf    TBILI_CAND_TX     27           9           9       9
 7 1     rf    WEIGHT_CAND_KG    24           8           8       8
 8 1     rf    DAYS_ACTIVE        3           1           1       1
 9 1     rf    DAYSWAIT_CHRON     3           1           1       1
10 1     rf    HEIGHT_CAND_CM     3           1           1       1

8 Difference between model predictions

This only shows difference from same model but choosing tuning parameters differently. Good - pretty robust to tuning parameters selection

8.1 Differences across models

[1] "starting model lr"
[1] "starting model bt1"
[1] "starting model bt"
[1] "starting model rf"

`geom_smooth()` using method = 'loess' and formula = 'y ~ x'

9 Causal Forests